Goto

Collaborating Authors

 Denali Borough


VocalBench-DF: A Benchmark for Evaluating Speech LLM Robustness to Disfluency

Liu, Hongcheng, Hou, Yixuan, Liu, Heyang, Wang, Yuhao, Wang, Yanfeng, Wang, Yu

arXiv.org Artificial Intelligence

While Speech Large Language Models (Speech-LLMs) show strong performance in many applications, their robustness is critically under-tested, especially to speech disfluency. Existing evaluations often rely on idealized inputs, overlooking common disfluencies, particularly those associated with conditions like Parkinson's disease. This work investigates whether current Speech-LLMs can maintain performance when interacting with users who have speech impairments. To facilitate this inquiry, we introduce VocalBench-DF, a framework for the systematic evaluation of disfluency across a multi-dimensional taxonomy. Our evaluation of 22 mainstream Speech-LLMs reveals substantial performance degradation, indicating that their real-world readiness is limited. Further analysis identifies phoneme-level processing and long-context modeling as primary bottlenecks responsible for these failures. Strengthening recognition and reasoning capability from components and pipelines can substantially improve robustness. These findings highlight the urgent need for new methods to improve disfluency handling and build truly inclusive Speech-LLMs


A Good Plan is Hard to Find: Aligning Models with Preferences is Misaligned with What Helps Users

Balepur, Nishant, Shu, Matthew, Sung, Yoo Yeon, Goldfarb-Tarrant, Seraphina, Feng, Shi, Yang, Fumeng, Rudinger, Rachel, Boyd-Graber, Jordan Lee

arXiv.org Artificial Intelligence

To assist users in complex tasks, LLMs generate plans: step-by-step instructions towards a goal. While alignment methods aim to ensure LLM plans are helpful, they train (RLHF) or evaluate (ChatbotArena) on what users prefer, assuming this reflects what helps them. We test this with Planorama: an interface where 126 users answer 300 multi-step questions with LLM plans. We get 4388 plan executions and 5584 comparisons to measure plan helpfulness (QA success) and user preferences on plans, and recreate the setup in agents and reward models to see if they simulate or prefer what helps users. We expose: 1) user/model preferences and agent success do not accurately predict which plans help users, so common alignment feedback can misalign with helpfulness; 2) this gap is not due to user-specific preferences, as users are similarly successful when using plans they prefer/disprefer; 3) surface-level cues like brevity and question similarity strongly link to preferences, but such biases fail to predict helpfulness. In all, we argue aligning helpful LLMs needs feedback from real user interactions, not just preferences of what looks helpful, so we discuss the plan NLP researchers can execute to solve this problem.


Trump latest: Migration crackdown, DeepSeek's rise, what's ahead on Tuesday

Al Jazeera

United States President Donald Trump signed a series of executive orders on Monday aimed at reshaping military policies, including the removal of diversity, equity and inclusion programmes (DEI), reinstating service members discharged for refusing COVID-19 vaccines, and barring transgender people from military service. Earlier in the day, newly confirmed Secretary of Defense Pete Hegseth, who secured the position after a narrow Senate vote, said he would ensure the orders "are complied with rapidly and quickly". Here is the latest news from Monday and a look ahead for the week. Speaking with reporters on board Air Force One on Monday, Trump said that he signed four executive orders. Among those, Trump revealed he signed an order to establish a framework for developing what his administration calls an "American Iron Dome," a missile defence system designed to protect the homeland.


Trump to declare national emergency at border in flurry of day one orders

BBC News

In a series of calls with reporters on Monday morning, incoming Trump administration officials outlined dozens of executive orders the president-elect planned to take when he officially takes office, including 10 focused on what one official described as "common sense immigration policy". Officials said that Trump plans to end birthright citizenship, meaning that the children of undocumented migrants living in the US will no longer automatically be considered US citizens. Birthright citizenship, however, is enshrined in the US constitution and would require a two-thirds vote in both chambers of Congress to change. The official provided no further detail on how Trump plans to accomplish this. As part of the national emergency designation at the border, Trump will also direct the Department of Defense to "seal the border" and surge additional resources and personnel, including counter-drone capabilities.


Quantifying and Mitigating Unimodal Biases in Multimodal Large Language Models: A Causal Perspective

Chen, Meiqi, Cao, Yixin, Zhang, Yan, Lu, Chaochao

arXiv.org Artificial Intelligence

Recent advancements in Large Language Models (LLMs) have facilitated the development of Multimodal LLMs (MLLMs). Despite their impressive capabilities, MLLMs often suffer from an over-reliance on unimodal biases (e.g., language bias and vision bias), leading to incorrect answers in complex multimodal tasks. To investigate this issue, we propose a causal framework to interpret the biases in Visual Question Answering (VQA) problems. Within our framework, we devise a causal graph to elucidate the predictions of MLLMs on VQA problems, and assess the causal effect of biases through an in-depth causal analysis. Motivated by the causal graph, we introduce a novel MORE dataset, consisting of 12,000 VQA instances. This dataset is designed to challenge MLLMs' abilities, necessitating multi-hop reasoning and the surmounting of unimodal biases. Furthermore, we propose two strategies to mitigate unimodal biases and enhance MLLMs' reasoning capabilities, including a Decompose-Verify-Answer (DeVA) framework for limited-access MLLMs and the refinement of open-source MLLMs through fine-tuning. Extensive quantitative and qualitative experiments offer valuable insights for future research. Our project page is at https://opencausalab.github.io/MORE.


Knowledge Graph Enhanced Large Language Model Editing

Zhang, Mengqi, Ye, Xiaotian, Liu, Qiang, Ren, Pengjie, Wu, Shu, Chen, Zhumin

arXiv.org Artificial Intelligence

Large language models (LLMs) are pivotal in advancing natural language processing (NLP) tasks, yet their efficacy is hampered by inaccuracies and outdated knowledge. Model editing emerges as a promising solution to address these challenges. However, existing editing methods struggle to track and incorporate changes in knowledge associated with edits, which limits the generalization ability of postedit LLMs in processing edited knowledge. To tackle these problems, we propose a novel model editing method that leverages knowledge graphs for enhancing LLM editing, namely GLAME. Specifically, we first utilize a knowledge graph augmentation module to uncover associated knowledge that has changed due to editing, obtaining its internal representations within LLMs. This approach allows knowledge alterations within LLMs to be reflected through an external graph structure. Subsequently, we design a graph-based knowledge edit module to integrate structured knowledge into the model editing. This ensures that the updated parameters reflect not only the modifications of the edited knowledge but also the changes in other associated knowledge resulting from the editing process. Comprehensive experiments conducted on GPT-J and GPT-2 XL demonstrate that GLAME significantly improves the generalization capabilities of post-edit LLMs in employing edited knowledge.


Symbol tuning improves in-context learning in language models

Wei, Jerry, Hou, Le, Lampinen, Andrew, Chen, Xiangning, Huang, Da, Tay, Yi, Chen, Xinyun, Lu, Yifeng, Zhou, Denny, Ma, Tengyu, Le, Quoc V.

arXiv.org Artificial Intelligence

We present symbol tuning - finetuning language models on in-context input-label pairs where natural language labels (e.g., "positive/negative sentiment") are replaced with arbitrary symbols (e.g., "foo/bar"). Symbol tuning leverages the intuition that when a model cannot use instructions or natural language labels to figure out a task, it must instead do so by learning the input-label mappings. We experiment with symbol tuning across Flan-PaLM models up to 540B parameters and observe benefits across various settings. First, symbol tuning boosts performance on unseen in-context learning tasks and is much more robust to underspecified prompts, such as those without instructions or without natural language labels. Second, symbol-tuned models are much stronger at algorithmic reasoning tasks, with up to 18.2% better performance on the List Functions benchmark and up to 15.3% better performance on the Simple Turing Concepts benchmark. Finally, symbol-tuned models show large improvements in following flipped-labels presented in-context, meaning that they are more capable of using in-context information to override prior semantic knowledge.


Generating Data for Symbolic Language with Large Language Models

Ye, Jiacheng, Li, Chengzu, Kong, Lingpeng, Yu, Tao

arXiv.org Artificial Intelligence

While large language models (LLMs) bring not only performance but also complexity, recent work has started to turn LLMs into data generators rather than task inferencers, where another affordable task model is trained for efficient deployment and inference. However, such an approach has primarily been applied to natural language tasks and has not yet been explored for symbolic language tasks with complex structured outputs (e.g., semantic parsing and code generation). In this paper, we propose SymGen which utilizes LLMs for generating various annotation-expensive symbolic language data. SymGen consists of an informative prompt to steer generation and an agreement-based verifier to improve data correctness. We conduct extensive experiments on six symbolic language tasks across various settings. Compared with the LLMs, we demonstrate the 1\%-sized task model can achieve comparable or better performance, largely cutting inference and deployment costs. We also show that generated data with only a few human demonstrations can be as effective as over 10 times the amount of human-annotated data when training the task model, saving a considerable amount of annotation effort. SymGen sheds new light on data generation for complex tasks, and we release the code at \href{https://github.com/HKUNLP/SymGen}{https://github.com/HKUNLP/SymGen}.


Shared Manifold Learning Using a Triplet Network for Multiple Sensor Translation and Fusion with Missing Data

Dutt, Aditya, Zare, Alina, Gader, Paul

arXiv.org Artificial Intelligence

Abstract--Heterogeneous data fusion can enhance the robustness and accuracy of an algorithm on a given task. However, due to the difference in various modalities, aligning the sensors and embedding their information into discriminative and compact representations is challenging. In this paper, we propose a Contrastive learning based MultiModal Alignment Network (CoMMANet) to align data from different sensors into a shared and discriminative manifold where class information is preserved. The proposed architecture uses a multimodal triplet autoencoder to cluster the latent space in such a way that samples of the same classes from each heterogeneous modality are mapped close to each other. Since all the modalities exist in a shared manifold, a unified classification framework is proposed. A comparison made with other methods demonstrates the superiority of this method. This method is also called decision fusion. In the context of a neural network, these outstanding results on tasks like land-use and land-cover representations are generated by the convolutional layers classification (LULC) [1] [2], mineral exploration [3] [4] and fused gradually to form a shared representation [5], urban planning [6], biodiversity conservation [7], sentiment layer. In Fusion methods can be classified into two groups: concatenation and alignment-based methods. Personal use of this material is permitted. To increase the interpretability learn spatial information by using a structured morphological of fusion models, Hong et al. [27] proposed a element of predefined size and shape. They proposed a graphbased shared and specific feature learning (S2FL) that is capable of model to couple the dimension reduction and fusion of decomposing data into modality-shared and modality-specific information. However, using this method, the cloud-covered components, which enables a better information blending of regions are not accurately classified because the morphological multiple heterogeneous modalities.


Minimum-Distortion Embedding

Agrawal, Akshay, Ali, Alnur, Boyd, Stephen

arXiv.org Machine Learning

We consider the vector embedding problem. We are given a finite set of items, with the goal of assigning a representative vector to each one, possibly under some constraints (such as the collection of vectors being standardized, i.e., have zero mean and unit covariance). We are given data indicating that some pairs of items are similar, and optionally, some other pairs are dissimilar. For pairs of similar items, we want the corresponding vectors to be near each other, and for dissimilar pairs, we want the corresponding vectors to not be near each other, measured in Euclidean distance. We formalize this by introducing distortion functions, defined for some pairs of the items. Our goal is to choose an embedding that minimizes the total distortion, subject to the constraints. We call this the minimum-distortion embedding (MDE) problem. The MDE framework is simple but general. It includes a wide variety of embedding methods, such as spectral embedding, principal component analysis, multidimensional scaling, dimensionality reduction methods (like Isomap and UMAP), force-directed layout, and others. It also includes new embeddings, and provides principled ways of validating historical and new embeddings alike. We develop a projected quasi-Newton method that approximately solves MDE problems and scales to large data sets. We implement this method in PyMDE, an open-source Python package. In PyMDE, users can select from a library of distortion functions and constraints or specify custom ones, making it easy to rapidly experiment with different embeddings. Our software scales to data sets with millions of items and tens of millions of distortion functions. To demonstrate our method, we compute embeddings for several real-world data sets, including images, an academic co-author network, US county demographic data, and single-cell mRNA transcriptomes.